Week 9.3 - Where AI Is Now Genuinely Strong

🎯 What We'll Cover

The trajectory frame from 9.1 cuts both ways. If you arrive thinking AI is more capable than it is, you over-rely on it. If you arrive thinking AI is less capable than it is, you miss things you could be doing. Both errors are common. This sub-lesson focuses on the second.

We look at concrete cases from 2025–26 where AI played a pivotal role in research-grade work in mathematics, theoretical physics, and beyond. The examples are deliberately drawn from fields where, two years ago, the consensus was that AI couldn't contribute meaningfully. The point is not that AI replaces researchers in these fields — in every case, human collaboration was essential and human verification was mandatory — but that the ceiling has moved.

A pedagogical caveat to flag now: these are selection-biased examples. They are the cases where AI worked. The ratio of attempts-to-successes is unknown. We close the sub-lesson with the caveats; read them.

➗ Mathematics

If there is one field where 2026 has decisively shifted, it is mathematics. The combination of frontier reasoning models (GPT-5.4 Pro, GPT-5.5 Pro, Gemini Deep Think) and tools that connect them to formal verification (AlphaProof, AlphaEvolve) has produced a stream of contributions to research-level problems.

📝 Erdős Problems Solved with Frontier Models

Since January 2026, 15 Erdős problems have moved from “open” to “solved” on the canonical Erdős problem database, with 11 of those crediting AI models as part of the process. The contributions are tracked at github.com/teorth/erdosproblems.

Erdős Problem #1196. A 1968 conjecture from Erdős, Sárközy and Szemerédi on primitive sets. Solved by GPT-5.4 Pro on 13 April 2026, with a solution that includes a Lean formalisation. A peer-reviewed-track follow-up paper extends the result substantially: Alexeev, B., Barreto, K., Li, Y., Lichtman, J. D., Price, L., Shah, J. I., Tang, Q., & Tao, T. (3 May 2026). Primitive sets and von Mangoldt chains: Erdős Problem #1196 and beyond. arXiv:2605.00301.

Erdős Problem #728. Solved by GPT-5.4 Pro in approximately 80 minutes from a single prompt. Tao's commentary: “the proof reveals a previously undescribed connection between the anatomy of integers and Markov process theory.” A meaningful contribution to the field that goes well beyond the specific problem.

📝 Mathematical Exploration at Scale

Georgiev, B., Gómez-Serrano, J., Tao, T., & Wagner, A. Z. (3 November 2025). Mathematical exploration and discovery at scale. arXiv:2511.02864

Tao and collaborators tested Google's AlphaEvolve — an evolutionary coding agent that combines an LLM with automated evaluation — on 67 mathematical problems across analysis, combinatorics, geometry, and number theory. The system rediscovered established solutions in most cases and identified improved solutions on several. The authors integrated AlphaEvolve with Deep Think and AlphaProof for proof generation. Tao's blog post on the paper at terrytao.wordpress.com is a useful accessible summary.

📝 DeepMind's AI Co-Mathematician (May 2026)

Announced on 8 May 2026 by Pushmeet Kohli's team at Google DeepMind, the AI Co-Mathematician is a multi-agent system built on Gemini 3.1 Pro that actively collaborates with human mathematicians on open research problems.

On FrontierMath Tier 4 — the hardest 50 problems in the benchmark, designed to take expert mathematicians hours or days — the system solved 23 of 48 non-public problems, a 48% accuracy rate. The base model alone (Gemini 3.1 Pro) scored 19% on the same benchmark; the entire jump came from agentic scaffolding with parallel agents reviewing each other's work.

And it has already been used to close a real open problem: topologist Marc Lackenby used the system to close problem 21.10 from the Kourovka Notebook, an open compendium of group theory problems maintained continuously since 1965 in Novosibirsk.

Source: AI Co-Mathematician: Accelerating Mathematicians with Agentic AI (arXiv:2605.06651, 7 May 2026)

📝 IMO 2025 — Gold Medal in Natural Language

An advanced version of Gemini with Deep Think achieved gold-medal performance at the 2025 International Mathematical Olympiad: 5 out of 6 problems perfect, 35/42 points, end-to-end natural-language proofs (not formal Lean), within the official 4.5-hour competition window. This is up from 2024, when AlphaProof + AlphaGeometry 2 took silver requiring expert formalisation into Lean and 2–3 days of computation. One year, formalisation overhead removed, time pressure met.

Centrepiece reading (revisited)

If you have not yet read it, return to Gowers (May 2026), A recent experience with ChatGPT 5.5 Pro. The Gowers post is the single most useful piece of writing I know on the current state of frontier-model mathematics, partly because Gowers is one of the few people who can credibly assess the quality of the work himself.

⚛ Theoretical Physics

The theoretical physics cases are particularly notable because they involve named senior physicists collaborating directly with AI on amplitude calculations. The work has appeared as preprints with credentialed authors, not as AI-lab demonstrations.

📝 Single-Minus Gluon Amplitudes — OpenAI × Strominger Group

Guevara, A., Lupsasca, A., Skinner, D., Strominger, A., & Weil, K. (12 February 2026). Single-minus gluon tree amplitudes are nonzero. arXiv:2602.12176

The paper's own abstract is explicit about the AI contribution:

“The key formula was first conjectured by GPT-5.2 Pro and then proved by a new internal OpenAI model.”

Human verification was mandatory: the proof was checked by hand using the Berends–Giele recursion and tested against multiple consistency conditions including Weinberg's soft theorem. The author list includes Andrew Strominger (Harvard) — among the most distinguished living theoretical physicists — and Kevin Weil on behalf of OpenAI. This is not an AI-lab demonstration; it is a real preprint with a real result.

📝 Graviton Extension

The same group extended the gluon result to gravitons in March 2026. Currently hosted at openai.com/index/extending-single-minus-amplitudes-to-gravitons/ (PDF at cdn.openai.com/pdf/graviton.pdf); arXiv version not yet posted at time of writing. The extension used the gluon results as context plus some guidance from the human physicists; GPT-5.2 Pro constructed the analogous single-minus scattering amplitudes for gravitons.

📝 Cosmic-String Gravitational Radiation

Brenner, M. P., Cohen-Addad, V., & Woodruff, D. (5 March 2026). Solving an Open Problem in Theoretical Physics using AI-Assisted Discovery. arXiv:2603.04735

A neuro-symbolic system — Gemini Deep Think combined with a tree-search framework and automated numerical feedback — computed the power spectrum of gravitational radiation emitted by cosmic strings. The system discovered six different analytical methods; the most elegant uses Gegenbauer-polynomial expansions and is exact and efficient. Of growing relevance given recent Pulsar Timing Array observations of the stochastic gravitational background.

📝 Quantum Many-Body Calculations

Pan, H., Mudur, N., Taranto, W., Tikhanovskaya, M., Venugopalan, S., Bahri, Y., Brenner, M. P., & Kim, E.-A. (2025). Quantum many-body physics calculations with large language models. Communications Physics. DOI:10.1038/s42005-025-01956-y

Tested GPT-4 on 15 quantum many-body papers from the past decade. With multi-step prompt templates, the model correctly derived the final Hartree–Fock Hamiltonian in 13 of 15 cases. The first systematic evaluation of LLMs on research-level physics calculations — and a result obtained even before the GPT-5 generation. Pre-registered prompt templates and step-by-step verification matter as much as model capability.

🧠 Adjacent Paradigms (Not LLM-based)

Not all of the AI-in-science story is LLMs. Two notable 2025 results use different machine-learning paradigms entirely. They are worth knowing because the limits of LLMs are not the limits of AI.

AI-Newton (Symbolic Discovery)

Fang, Y.-L., Jian, D.-S., Li, X., & Ma, Y.-Q. (April 2025). AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge. arXiv:2504.01538. From Peking University. Symbolic regression that, given experimental data, autonomously rediscovers Newton's second law, energy conservation, and the law of gravitation — without prior physics knowledge built in.

Physics-Tailored ML in Dusty Plasmas

Yu, W., Abdelaleem, E., Nemenman, I., & Burton, J. C. (2025). Physics-tailored machine learning reveals unexpected physics in dusty plasmas. PNAS 122(31), e2505725122. DOI:10.1073/pnas.2505725122. Custom neural network trained on dusty-plasma trajectory data infers non-reciprocal forces with R² ≈ 0.99. From Emory University. Demonstrates that physics-aware ML can rediscover laws and even reveal new ones from carefully-designed experiments.

🤖 Autonomous Research Pipelines

📝 The AI Scientist v2 — First AI-Authored Peer-Reviewed Paper

Lu, C., Lu, C., Lange, R. T., et al. (26 March 2026). Towards end-to-end automation of AI research. Nature 651, 914–919. DOI:10.1038/s41586-026-10265-5

Sakana AI's second-generation AI Scientist — agentic tree search, no human-authored code templates — produced three manuscripts that were submitted to a peer-reviewed workshop at ICLR 2025. One manuscript passed peer review with an average reviewer score of 6.33, placing it in roughly the top 45% of submissions. Sakana withdrew the paper before final publication, but the milestone — first fully AI-generated paper to pass rigorous human peer review — is real. Includes important caveats about generality and replicability.

💻 Code, Writing, and Other Domains

Code at the Frontier

Claude Opus 4.7 at SWE-bench Verified 87.6%; DeepSeek V4 Pro at Codeforces 3206 (top of any model); GPT-5.5 Pro at MCP-Atlas 77.3%. Frontier models in 2026 are research-grade for many programming tasks. The Week 7 caveat applies: code that runs is not code that's correct. Verification and known-answer testing are still mandatory.

Scientific Writing

Capability has improved substantially. Frontier models produce drafts that read like academic prose without obvious AI tells. The Week 6 caveats apply: AI suggestions homogenise writing toward Western styles (Agarwal, Naaman & Vashistha, CHI 2025); citation hallucination rates remain non-zero; the “writing as thinking” concern is more important now, not less.

Multimodal

Pointer to Week 8. Frontier models in 2026 can describe scientific figures, transcribe audio (with hallucination caveats), parse complex documents, and process video. The capability is real; the verification overhead remains substantial.

⚠️ Essential Caveats

Read these. Without them, the cases above can be miscalibrated as “AI now does mathematics and physics autonomously”. That is not the claim being made.

1. Selection Bias

Every case above is one where AI worked. The ratio of attempts-to-successes is unknown. For every Erdős problem solved, dozens may have been attempted with no result. For every gluon-amplitude formula correctly conjectured, an unknown number of incorrect ones were rejected. The Gowers post itself is honest about this: ChatGPT's initial outputs were “rambling” and required multiple iterations.

2. Human Collaboration Was Essential in Every Single Case

The gluon-amplitude paper required Strominger and colleagues to feed GPT-5.2 specific human-computed cases as a starting point. The Erdős solutions required follow-up work to formalise and extend. The IMO gold required problem-statements to be presented in their canonical form. The Gowers experiment required Gowers' expertise to evaluate what the model had produced. These are all collaborations, not autonomous discovery. Reading them as “AI did mathematics” misses the structure.

3. Human Verification Was Always Required

Even when the AI's contribution was substantial, the proof was checked by humans. The Lean formalisations of the Erdős solutions provided machine-checkable verification. The Berends–Giele recursion check on the gluon formula was done by hand. The cosmic-string-radiation result was tested numerically against existing partial solutions. Hallucination remains a real risk in any AI mathematical or physical output. The verification step is not optional.

4. Several of These Are Preprints, Not Yet Peer-Reviewed

The gluon and graviton papers are arXiv (and OpenAI-hosted) preprints. Several of the Erdős solutions are documented in forum posts and follow-up arXiv papers but have not all completed formal peer review. Some may not survive review unchanged. The status as of May 2026 is provisional, not consensus.

What this sub-lesson asks of you

Update your calibration. If your prior on “AI in mathematics” was based on Frieder et al. (2023), look at Gowers (2026), Erdős #1196, and the IMO 2025 gold. If your prior on “AI in theoretical physics” was “not really”, look at the gluon paper.

And: hold the calibration loosely. The structural failures from 9.2 are still operative in every case above. The ceiling has moved; the floor of verification has not.

In your own field: test current frontier models on tasks you assumed they couldn't do. The activities in 9.6 ask you to do exactly this.

👉 What Comes Next

Sub-Lesson 9.4 — Illusions of Understanding. Having now seen both what AI fails at (9.2) and what it's now genuinely good at (9.3), we turn to a deeper question: when AI is right, do you understand the result? The Messeri & Crockett (2024, Nature) framework is the centrepiece.